30 research outputs found

    Towards a Software Transactional Memory for heterogeneous CPU-GPU processors

    Get PDF
    The heterogeneous Accelerated Processing Units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs provide the programmer with platform atomics, used to communicate the CPU cores with the GPU using simple atomic datatypes. However, ensuring consistency for complex data types is a task delegated to programmers, who have to implement a mutual exclusion mechanism. Transactional Memory (TM) is an optimistic approach to implement mutual exclusion. With TM, shared data can be accessed by multiple computing threads speculatively, but changes are only visible if a transaction ends with no conflict with others in its memory accesses. TM has been studied and implemented in software and hardware for both CPU and GPU platforms, but an integrated solution has not been provided for APU processors. In this paper we present APUTM, a software TM designed to work on heterogeneous APU processors. The design of APUTM focuses on minimizing the access to shared metadata in order to reduce the communication overhead via expensive platform atomics. The main objective of APUTM is to help us understand the tradeoffs of implementing a sofware TM on an heterogeneous CPU-GPU platform and to identify the key aspects to be considered in each device. In our experiments, we compare the adaptability of APUTM to execute in one of the devices (CPU or GPU) or in both of them simultaneously. These experiments show that APUTM is able to outperform sequential execution of the applications.This work has been supported by projects TIN2013-42253-P and TIN2016-80920-R, from the Spanish Government, P11-TIC8144 and P12- TIC1470, from Junta de Andalucía, and Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Improvements in Hardware Transactional Memory for GPU Architectures

    Get PDF
    In the multi-core CPU world, transactional memory (TM)has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level. After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Hardware support for Local Memory Transactions on GPU Architectures

    Get PDF
    Graphics Processing Units (GPUs) are popular hardware accelerators for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. However, the SIMT execution model is not efficient when code includes critical sections to protect the access to data shared by the running threads. In addition, GPUs offer two shared spaces to the threads, local memory and global memory. Typical solutions to thread synchronization include the use of atomics to implement locks, the serialization of the execution of the critical section, or delegating the execution of the critical section to the host CPU, leading to suboptimal performance. In the multi-core CPU world, transactional memory (TM) was proposed as an alternative to locks to coordinate concurrent threads. Some solutions for GPUs started to appear in the literature. In contrast to these earlier proposals, our approach is to design hardware support for TM in two levels. The first level is a fast and lightweight solution for coordinating threads that share the local memory, while the second level coordinates threads through the global memory. In this paper we present GPU-LocalTM as a hardware TM (HTM) support for the first level. GPU-LocalTM offers simple conflict detection and version management mechanisms that minimize the hardware resources required for its implementation. For the workloads studied, GPU-LocalTM provides between 1.25-80X speedup over serialized critical sections, while the overhead introduced by transaction management is lower than 20%.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Pipeline template for streaming applications on heterogeneous chips

    Get PDF
    We address the problem of providing support for executing single streaming applications implemented as a pipeline of stages that run on heterogeneous chips comprised of several cores and one on-chip GPU. In this paper, we mainly focus on the API that allows the user to specify the type of parallelism exploited by each pipeline stage running on the multicore CPU, the mapping of the pipeline stages to the devices (GPU or CPU), and the number of active threads. We use a real streaming application as a case of study to illustrate the experimental results that can be obtained with this API. With this example, we evaluate how the different parameter values affect the performance and energy efficiency of a heterogenous on-chip processor (Exynos 5 Octa) that has three different computational cores: a GPU, an ARM Cortex-A15 quad-core, and an ARM Cortex-A7 quad-core.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech. Proyecto de Excelencia de la Junta de Andalucía P11-TIC-08144

    Prácticas de ensamblador basadas en Raspberry Pi

    Get PDF
    Este trabajo se enmarca dentro del Proyecto de Innovación Educativa PIE13-082, “Motivando al alumno de ingeniería mediante la plataforma Raspberry Pi” cuyo principal objetivo es aumentar el grado de motivación del alumno que cursa asignaturas impartidas por el Departamento de Arquitectura de Computadores. La estrategia propuesta se apoya en el hecho de que muchos alumnos de Ingeniería perciben que las asignaturas de la carrera están alejadas de su realidad cotidiana, y que por ello, pierden cierto atractivo. Sin embargo, bastantes de estos alumnos han comprado o piensan comprar un minicomputador Raspberry Pi que se caracteriza por proporcionar una gran funcionalidad, gracias a estar basado en un procesador y Sistema Operativo de referencia en los dispositivos móviles. En este proyecto proponemos aprovechar el interés que los alumnos ya demuestran por la plataforma Raspberry Pi, para ponerlo a trabajar en pro del siguiente objetivo docente: facilitar el estudio de conceptos y técnicas impartidas en varias asignaturas del Departamento. Más concretamente, el principal objetivo de este trabajo es la creación de un conjunto de prácticas enfocadas al aprendizaje de la programación en ensamblador, en concreto del ARMv6 que es el procesador de la plataforma que se va a utilizar para el desarrollo de las prácticas, así como al manejo a bajo nivel de las interrupciones y la entrada/salida en dicho procesador. La presente memoria está dividida cinco capítulos y cuatro apéndices. De los 5 capítulos, el primero es introductorio. Los dos siguientes se centran en la programación de ejecutables en Linux, tratando las estructuras de control en el capítulo 2 y las subrutinas (funciones) en el capítulo 3. Los dos últimos capítulos muestran la programación en Bare Metal, explicando el subsistema de entrada/salida (puertos de entrada/salida y temporizadores) de la plataforma Raspberry Pi y su manejo a bajo nivel en el capítulo 4 y las interrupciones en el capítulo 5. En los apéndices hemos añadido aspectos laterales pero de suficiente relevancia como para ser considerados en la memoria, como el apendice A que explica el funcionamiento de la macro ADDEXC, el apéndice B que muestra todos los detalles de la placa auxiliar, el apéndice C que nos enseña a agilizar la carga de programas Bare Metal y por último tenemos el apéndice D, que profundiza en aspectos del GPIO como las resistencias programables

    Adaptive Partition Strategies for Loop Parallelism in Heterogeneous Architectures

    Get PDF
    Este trabajo describe nuestra contribución para la ejecución de bucles paralelos en arquitecturas multi-core/multi-GPU de forma que la carga computacional se distribuya de forma balanceada entre todas las unidades de computación.This paper explores the possibility of efficiently using multicores in conjunction with multiple GPU accelerators under a parallel task programming paradigm. In particular, we address the challenge of extending a parallel_for template to allow its exploitation on heterogeneous systems. The extension is based on a two-stages pipeline engine which is responsible for partitioning and scheduling the chunks into the computational resources. Under this engine, we propose a dynamic scheduling strategy coupled with an adaptive partitioning heuristic that resizes chunks to prevent underutilization and load unbalance of CPUs and GPUs. In this paper we introduce the adaptive partitioning heuristic which is derived from an analytical model that minimizes the load unbalance while maximizes the throughput in the system. Using two benchmarks we evaluate the overhead introduced by our template extensions finding that it is negligible. We also evaluate the efficiency of our adaptive partitioning strategies and compared them with related work.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech. TIN2010-16144, P08-TIC-3500, P11-TIC-0814

    C++ for Heterogeneous Programming: oneAPI (DPC++ and oneTBB)

    Get PDF
    This tutorial provides hands-on experience programming CPUs, GPUs and FPGAs using a unified, standards-based programming model: oneAPI. oneAPI includes a cross-architecture language: Data Parallel C++ (DPC++). DPC++ is an evolution of C++ that incorporates the SYCL language with extensions for Unified Shared Memory (USM), ordered queues and reductions, among other features. oneAPI also includes libraries for API-based programming, such as domain-specific libraries, math kernel libraries and Threading Building Blocks (TBB). The main benefit of using oneAPI over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code. In the current heterogeneous era, it is still challenging for developers to match computations to accelerators and to coordinate the use of those accelerators in the context of their larger applications. Therefore, this tutorial’s main goal is not just teaching oneAPI as an easier approach to target heterogeneous platforms, but also to convey techniques to map applications to heterogeneous hardware paying attention to the scheduling and mapping problems (how to achieve load balance and which regions of the application are more suitable to each particular device).Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Energy Efficiency of Software Transactional Memory in a Heterogeneous Architecture

    Get PDF
    Hardware vendors make an important effort creating low-power CPUs that keep battery duration and durability above acceptable levels. In order to achieve this goal and provide good performance-energy for a wide variety of applications, ARM designed the big.LITTLE architecture. This heterogeneous multi-core architecture features two different types of cores: big cores oriented to performance and little cores, slower and aimed to save energy consumption. As all the cores have access to the same memory, multi-threaded applications must resort to some mutual exclusion mechanism to coordinate the access to shared data by the concurrent threads. Transactional Memory (TM) represents an optimistic approach for shared-memory synchronization. To take full advantage of the features offered by software TM, but also benefit from the characteristics of the heterogeneous big.LITTLE architectures, our focus is to propose TM solutions that take into account the power/performance requirements of the application and what it is offered by the architecture. In order to understand the current state-of-the-art and obtain useful information for future power-aware software TM solutions, we have performed an analysis of a popular TM library running on top of an ARM big.LITTLE processor. Experiments show, in general, better scalability for the LITTLE cores for most of the applications except for one, which requires the computing performance that the big cores offer.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Time Series Heterogeneous Co-execution on CPU+GPU

    Get PDF
    Time series motif (similarities) and discords discovery is one of the most important and challenging problems nowadays for time series analytics. We use an algorithm called “scrimp” that excels in collecting the relevant information of time series by reducing the computational complexity of the searching. Starting from the sequential algorithm we develop parallel alternatives based on a variety of scheduling policies that target different computing devices in a system that integrates a CPU multicore and an embedded GPU. These policies are named Dynamic -using Intel TBB- and Static -using C++11 threads- when targeting the CPU, and they are compared to a heterogeneous adaptive approach named LogFit -using Intel TBB and OpenCL- when targeting the co-execution on the CPU and GPU.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
    corecore